A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
نویسندگان
چکیده
BACKGROUND Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task. METHODS The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure. RESULTS Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system. CONCLUSIONS The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature.
منابع مشابه
Recognizing Chemical Entities in Biomedical Literature using Conditional Random Fields and Structured Support Vector Machines
The Spanish National Cancer Research Center (CNIO) and University of Navarra organized a challenge on recognizing chemical compounds and drugs (chemical entities) in biomedical literature, which includes two individual subtasks: 1) chemical entity mention recognition (CEM); and 2) chemical document indexing (CDI). The challenge organizers manually annotated chemical entities in 10000 abstracts ...
متن کاملCD-REST: a system for extracting chemical-induced disease relation in literature
Mining chemical-induced disease relations embedded in the vast biomedical literature could facilitate a wide range of computational biomedical applications, such as pharmacovigilance. The BioCreative V organized a Chemical Disease Relation (CDR) Track regarding chemical-induced disease relation extraction from biomedical literature in 2015. We participated in all subtasks of this challenge. In ...
متن کاملUTH-CCB@BioCreative V CDR Task: Identifying Chemical-induced Disease Relations in Biomedical Text
This paper describes the system developed by the UTH-CCB team from the University of Texas Health Science Center at Houston (UTHealth), for the 2015 BioCreative V shared tasks of Track 3 on extraction of chemical disease relation (CDR). We participated in both tasks: Task A for “Disease Named Entity Recognition and Normalization (DNER)” and Task B for “Chemical-induced Diseases Relation Extract...
متن کاملConditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts
We present a comparative study between two machine learning methods, Conditional Random Fields and Support Vector Machines for clinical named entity recognition. We explore their applicability to clinical domain. Evaluation against a set of gold standard named entities shows that CRFs outperform SVMs. The best F-score with CRFs is 0.86 and for the SVMs is 0.64 as compared to a baseline of 0.60.
متن کاملDisease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks
The recognition of disease and chemical named entities in scientific articles is a very important subtask in information extraction in the biomedical domain. Due to the diversity and complexity of disease names, the recognition of named entities of diseases is rather tougher than those of chemical names. Although there are some remarkable chemical named entity recognition systems available onli...
متن کامل